Building small scale models of multi entity databases by clustering
نویسنده
چکیده
A framework is proposed to build small scale models of very large databases describing several entities and their relationships In a rst part it is shown that the use of sampling is not a good solution when several entities are stored in a database In the second part a model is proposed which is based on clustering all entities of the database and storing aggregates on the clusters and on the relationships between the clusters The last part of the paper discusses the di erent problems which are opened by this approach Some solutions are proposed in particular the link with symbolic data analysis is established Introduction and motivation Every day more and more data are generated by computers in all elds of activity Operational databases create and update detailed data for manage ment purposes Data from operational databases are transferred into data warehouses when they need to be used for decision aid purposes In some cases data are summarized usually by aggregation processes when loaded into the data warehouse but in many cases detailed data are kept This leads to very large amounts of data in data warehouses especially due to the fact that historical data are kept On the other hand many analyzes operated on data warehouses do not need such detailed data data cubes i e n way arrays are often used at a very aggregated level data mining or data analysis methods only use aggregated data The goal of this paper is to discuss methods for reducing the volume of data in data warehouses preserving the possibility to perform needed an alyzes An important issue in databases and data warehouses is that they describe several entities populations which are linked together by relation ships This paper tackles this fundamental aspect of databases and proposes solutions to deal with it The paper is organized as follows Section is devoted to the presenta tion of related work both in the elds of databases and statistics Section describes how several entities and their relationships are stored in databases and data warehouses In Section it is shown that the use of sampling is not appropriate for several reasons Section presents the model we propose H ebrail and Lechevallier for building small scale models SSM of multi entity databases the model is based on a clustering of all entities and a storage of information on the clusters instead of the detailed entities Section discusses the main outcome problems to this approach the choice of the clustering approach the use of the SSM the updatability of the SSM Section nally establishes a link between this work and the approach of symbolic data analysis proposed by Diday
منابع مشابه
Evaluation of Updating Methods in Building Blocks Dataset
With the increasing use of spatial data in daily life, the production of this data from diverse information sources with different precision and scales has grown widely. Generating new data requires a great deal of time and money. Therefore, one solution is to reduce costs is to update the old data at different scales using new data (produced on a similar scale). One approach to updating data i...
متن کامل(Inductive) Querying Environment for Predictive Clustering Trees
Inductive databases tightly integrate databases with data mining. Besides data, an inductive database also stores models that have been obtained by running data mining algorithms on the data. By means of a querying environment, the user can query the database and retrieve particular models. In this paper, we propose such a querying environment. It can be used for building new models and for sea...
متن کاملOptimization of majority protocol for controlling transactions concurrency in distributed databases by multi-agent systems
In this paper, we propose a new concurrency control algorithm based on multi-agent systems which is an extension of majority protocol. Then, we suggest a clustering approach to get better results in reliability, decreasing message passing and algorithm’s runtime. Here, we consider n different transactions working on non-conflict data items. Considering execution efficiency of some different...
متن کاملPropagating Updates of Residential Areas in Multi-Representation Databases Using Constrained Delaunay Triangulations
Updating topographic maps in multi-representation databases is crucial to a number of applications. An efficient way to update topographic maps is to propagate the updates from large-scale maps to small-scale maps. Because objects are often portrayed differently in maps of different scales, it is a complicated process to produce multi-scale topographic maps that meet specific cartographical cri...
متن کاملMining the Banking Customer Behavior Using Clustering and Association Rules Methods
The unprecedented growth of competition in the banking technology has raised the importance of retaining current customers and acquires new customers so that is important analyzing Customer behavior, which is base on bank databases. Analyzing bank databases for analyzing customer behavior is difficult since bank databases are multi-dimensional, comprised of monthly account records and daily t...
متن کامل